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An enormous amount of genetic sequence data has been obtained with gel-based DNA sequencing 
methods [HNTJ. To understand the function of the genes and their health implications, genetic variation 
over vast numbers of cells, tissues, individuals, and organisms must be examined. The complex 
expression patterns of the estimated 100,000 genes comprising the human genome l"HN21 and the 
intricate developmental signals that define cellular fate will need to be understood. The magnitude of 
this analysis will dwarf the task of obtaining the primary sequence of the human genome, and so 
efficient means to experimentally access vast amounts of genetic information are critically needed. 

Conventionally, researchers use analytical techniques to resolve sequence at the single-nucleotide level 
(i). In contrast, biological systems read, store, and modify genetic information, using the simple rules of 
molecular recognition. Each DNA strand carries the capacity to recognize uniauelv cnrnnlementary 
seauence through base pairing. The process of recognition, or hybridization | HN3]. is highly parallel, 
ana every sequence in a complex mixture can, in principle, be interrogated at the same time. Application 
of this highly desirable concept to sequence analysis has awaited new combinatorial technoloe^* to 
generate high-Hensitv ordered arravs of laree number?? of oligonucleotide nro^^ijZ). 

At Affymetrix l"HN4] , we developed ways to synthesize and assay biological molecules in a highly 
dense parallel format. Integration of two key technologies forms the cornerstone of the method (3). The 
first technology, light-directed combinatorial chemistry [HNS], enables the synthesis of hundreds of 
thousands of discrete compounds at high resolution in precise locations on a substrate. The second, laser 
confocal fluorescence scanning rHN6]. permits measurement of molecular interactions on the array. This 
technology is now commercial, with complete systems exported to dozens of sites. 

Light-directed chemical synthesis employs two mature technologies: photolithography fHN71 and 
solid-phase synthesis [HN8] . Synthetic linkers modified with photochemically removable protecting 
groups are attached to a glass substrate. Light is directed through a photolithographic mask to specific 
areas of the surface to produce localized photodeprotection. The first of a series of chemical building 
blocks-hydroxyl-protected deoxynucleosides, for example-is incubated with the surface, and chemical 
coupling occurs attfiooe sites that have been illuminated in the preceding step. Next, light is directed to a 
different region of the substrate by a new mask, and the chemical cycle is repeated. Highly efficient 
strategies can be used to synthesize any arbitrary r*-ohe at any discrete, specified location on the array in 
a minimum number of chemical steps, hor example, uic c'ompletesgt C| f 4^ poly deoxynucleotides of 

length N, or any subset of this set, can be synthesized in only 4 fv§£^ Af chemical cycles. Thus, 
given a reference sequence, a DNA chip can be designed that consists of a highly dense array of 
complementary probes with no restriction on design parameters. The amount of nucleic acid information 
encoded on the chip in the form of different probes is limited only by the physical size of the array and 
the achievable lithographic resolution. Current bulk manufacturing methods allow for -409,000 
polydeoxynucleotides to be synthesized on 1 .28-cm by 1 .28-cm chips. 
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Arrayed for sequencing. DNA chip fabricated by photolithography. 



Photolithography allows the construction of probe arrays with extremely high information content. 
Because the array is constructed on glass, it can be inverted and mounted in a temperature-controlled 
hybridization chamber. A target sequence is fluorescently tagged and then injected into the chamber, 
where the target hybridizes to its complementary sequences on the array. Laser excitation fHN9] enters 
through the back of the array, focused at the interface of the array surface and the target solution. 
Fluorescence emission is collected by a lens and passes through a series of optical filters to a sensitive 
detector. By simply scanning the laser beam or translating the array, or a combination of both, a 
quantitative two-dimensional fluorescence image of hybridization intensity is rapidly obtained. 
Commercial instrumentation for controlling the hybridization and scanning of the arrays, and software 
for image and data analysis have been developed. This approach requires only minute consumption of 
chemical reagents and minute preparations of biological samples. 

An array of oligonucleotides complementary to subsequences of a target sequence can be used to 
identify a target sequence, measure its amount or relative expression level, and detect differences 
between the target and a reference sequence. Many different arrays can be designed for these purposes, 
and the applications appear to be only limited by imagination. The system consists of chips, a 
hybridization station to control hybridization, and a reader and software to access the chip data. Specific 
chip products for expression analysis, HIV array resistance screening, and gene resequencing are already 
on the market. Two versions of commercial readers are available: a first-generation system from 
Molecular Dynamics rHNlO] as well as a recently released high-performance system from 
Hewlett-Packard [HN11] . Chip production is now in a scalable format. We are now producing -5,000 to 
10,000 chips per month, and we expect a large increase in production in the near future. 

To fully understand gene expression, gene function, and the subtleties of regulation, the quantitative 
levels of expressed genes under various conditions must be assayed. In addition, if quantitative 
"snapshots" of gene expression can be captured, the dynamics of cellular pathways can then be 
deciphered. Recently, Lockhart et al (4), published methods for the quantitative parallel measurement of 
cellular messenger RNA for gene sequences encoded on the chip solely from primary sequence data. 
RNAs present at a frequency of 1 : 3 00,000 were unambiguously detected with a quantitative assay 
spanning three to four orders of magnitude in concentration. Currently, Lockhart and group have 
developed chips containing the complete open reading frames from the yeast genome |"HN12], a series of 
"custom" chips with hundreds to thousands of full-length genes or fragments from various databases, as 
well as "standard" chips containing more than 6,500 genes from the public databases. An expression 
chip with more than 50,000 expressed sequence tags from the public databases [HN131 is currently in 
development. 
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This method complements other noncombinatorial array-based methods that involve the serial spotting 
of multiple clones or complementary DNAs (cDNAs) onto nylon membranes or modified microscope 
slides (5). These latter methods are inherently parallel for the analysis of sequence information, and they 
complement the probe-based arrays in their ability to use previously nonsequenced biological materials 
for the expression analysis. However, hybridization to large cDNA or PCR products can be 
thermodynamicaily more stable than to a series of shorter discriminating oligonucleotides and can yield 
confusing cross-hybridization signals when examining closely related genes, gene families, or other 
variants. Nonetheless, as experimentation continues, new ways around these problems will be found. In 
combination, the synthetic and mechanical techniques build a powerful new platform for parallel gene 
expression and offer large time and labor improvements over traditional cDNA library sequencing 
methods 0204]. 

Understanding the relationship between genotype and phenotype is a critical technical bottleneck in 
modem genetics. For example, consider examining 50 kilobases (kb) of coding sequence for 1000 
individuals. The genes are known, but the prevalence, location, and identity of polymorphisms are not. 
The methods of conventional gel-based sequencing that are so effective in the initial gene sequencing are 
not efficient for this task. Comparative gel-based sequencing is indistinguishable from a de novo 
sequencing reaction, and so the de novo sequencing reaction must be carried out for all 50 kb over the 
1000 individuals, or roughly 50 Mb of sequence. This illustrates the stark contrast of technical 
requirements when compared to a chip-based approach. 

Chee et al (6) recently showed how the entire human mitochondrial genome FHN15] can be sequenced 
with high accuracy in a single hybridization experiment. A total of -135,000 oligonucleotide probes 
were used to check the sequence of -33 kb (forward and reverse strands) of the mitochondrial genome in 
one reaction. In addition, 179 of the 180 polymorphisms present in control samples were correctly 
detected. Two-color comparative sequence analysis experiments were performed that demonstrated how 
mutations or polymorphisms could be detected on a very large scale, making it now possible to use the 
technology for large-scale polymorphism screening efforts. At the current state of development, 1.28-cm 
by 1 .28-cm chips can contain enough probes to scan anywhere from 32 kb to more than several hundred 
kilobases of sequence, depending on the specific chip design and accuracy requirements of the screen. 
Put in the context of the previously posed experiment, 1000 chips each containing 50 kb could easily 
and quickly perform the comparative sequence analysis. 




Making the matrix. Glass wafer is divided into DNA chips that contain the probe arrays. 



Designing arrays to detect specific allelic variation is relatively straightforward. In addition to using chip 
designs appropriate to scan a sequence (as in the polymorphism application), blocks of probes can be 
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dedicated to the specific detection of known allelic variation. Cronin et al. (7) designed a chip-based 
assay to detect multiple mutations in the CFTR gene, Kozal et al. (8) targeted HTV fHN16]. Hacia et al. 
(9) examined the BRCA1 gene rHN171. and a number of new designs are in development for 
examination of p53 and cytochrome p450, and for microbial identification and antibiotic resistance. The 
amount of data coded on the array is limited only by the number of probes used per data point, the 
available synthesis area, and the synthesis resolution. 

Recently, in collaboration with Lander's group at the Whitehead Institute (Cambridge, MA) rHN181. 
Chee and Lipshutz have initiated work on a single nucleotide polymorphism (SNP) mapping chip QO). 
The immediate objective is to identify the common polymorphisms (those of -20 to 50% frequency) 
contained within the mapped sequence-tagged site collection at the Whitehead Institute. These then form 
the basis set of biallelic markers that can be amplified from genomic DNA and applied to a probe array. 
Similar to the design used to detect allelic variation in the CFTR gene, blocks of probes are dedicated to 
each polymorphic form of the marker. This allows a straightforward detection of whether the sample is 
homozygous or heterozygous for each marker. These experiments offer enormous savings in time and 
labor, compared to standard gel-based microsatellite methods. Currently, prototype mapping chips 
containing -500 markers are being produced, with plans to expand to a 2000-marker chip by the end of 
the year. These chips will be used for a number of applications, including studies of linkage, association, 
and loss of heterozygosity measurements. 

The challenges of linking sequence variation to biological function are many. When the sequence is 
available, chips containing every gene in the human genome (chips for other genomes such as yeast have 
already been manufactured) can be produced, allowing genome-wide expression analysis. This should 
have a profound influence on the ability to elucidate the metabolic and disease pathways fHN191 of the 
cell under a variety of developmental and environmental perturbations and have immediate applications 
in toxicology studies and pharmaceutical development fHN20] . Screening chips will allow the 
databasing of large numbers of polymorphisms, and SNP chips will uncover how they are associated 
with disease. Chips providing insight into the genetics of model organisms have already been developed 
with strategies that will no doubt expand to include any organism of interest (JJJ. DNA chip technology 
moves genetic sequence analysis away from serial gel-based methods to a massively parallel screening 
format. In time, technology will be needed to make this same paradigm shift for the hundreds of 
thousands of proteins, chemical messengers, and other molecular components of life. 

TechWire Forum: www.sciencemag.orp/dmail.cgi?53241 
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HyperNotes 

Related Resources on the World Wide Web 

Numbered Hypernotes 

1. This site, located on the Web server of George Church's laboratory at Harvard University, presents 
a convenient jumping point to all of the Web genome sequencing projects . Links to most of the 
organisms that are currently being sequenced can be found here-from worms to humans. 

2. The Genome Database is a major site on the Web where different types of data collected on the 
human genome are being archived. This site collects and organizes regions of the human genome, 
clones, amplimers (PCR markers), breakpoints, cytogenetic markers, fragile sites, expressed 
sequence tags (ESTs), syndromic regions, and contigs. It also houses current maps of the human 
genome, including cytogenetic maps, linkage maps, radiation hybrid maps, content contig maps, 
and integrated maps. 



3. This primer on molecular genetics has been created by the U.S. Department of Energy. It gives 
excellent background data on the molecular basis of DNA, genes, and chromosomes and is suited 
for both beginers and experts. 

4. The home page of Affvmetrix gives a listing of the current GeneChip products used for chip-based 
DNA detection as well as background information on the research activities of the company. 

5. Interested in learning more about chemistry or participating in online discussions with other 
chemists? The home page of the Roval Society of Chemistry provides a wealth of information, 
including a list of chemistry servers on the Internet. 

6. To learn more about confocal microscopy techniques, see this site presented by L. Ladic from the 
University of British Columbia. There is a basic introduction of the techniques and tips for 
specimen preparation. 

7. Photolithography is used in industry to produce the integrated circuits that serve as the brains of 
computers. This tutorial provides information on how the process works and on how it is 
evolving. 

8. This site collects references to the most recent papers published in the field of solid-phase 
synthesis and combinatorial chemistry. It also has a large collection of links to companies that sell 
products for solid-phase synthesis. 

9. The Beckman Laser Institute and Medical Clinic at the University of California at Irvine has 
collected a large amount of basic information on the application of lasers to research and 
medicine. Examples include how lasers are used to remove wrinkles or improve vision. 

10. The home page of Molecular Dynamics provides information on advanced imaging techniques for 
biology and chemistry. The online discussion is a good way to get the latest tricks and tips on 
using the imaging techniques. 

11. The Hewlett-Packard Web site has some interesting surprises, including a focus on the basic 
science of chemistry. Their Internet chemistry resource list is useful to basic scientists as well as 
commercial researchers. 
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12. To learn about almost every database available on the Interet for the veast Saccharomvces 
cerevisiae, turn to this site at the National Institutes of Health. Here you can get any portion of the 
completely sequenced genome as well as submit queries to the XREF database. 

13. The National Center for Biotechnology Information at the National Institutes of Health is the 
gateway to the world's public databases housing nucleic acid and protein sequences . 

14. Traditional high-throughput sequencing relies heavily on the use of automated sequencing 
equipment. The Applied Bi osvstems machine is currently one of the most frequently used. This 
Web site describes some of its features. 

15. MITOMAP, provided by A. M. Kogelnik, M. T. Lott, M. D. Brown, S. B. Navathe, and D.C. 
Wallace of Emory University and the Georgia Institute of Technology (Atlanta, GA), serves as a 
home base for studies on human mitochondrial DNA(mtONA) . It houses the entire mtDNA 
sequence, mutation collections, and population data. 

16. I. Fenton, of the University of Wales College of Medicine (UK), maintains the Human Gene 
Mutation Database at the Institute of Medical Genetics in Cardiff, which collects and maintains 
data on many of the known mutations in human genes that cause diseases, including cystic 
fibrosis . 

17. The Breast Cancer Information Core houses a mutation database of known mutations in the gene 
BRCA1. Although the site can be accessed by members only, membership is free and open to the 
scientific public by following the guidelines presented and filling out their form. 

18. The home page of the Whitehead Institute for Biomedical Research provides this link to their 
genome center, a joint project with the Massachusetts Institute of Technology. 

19. V. A. McKusick and his colleagues at Johns Hopkins University and elsewhere provide the Online 
Mendelian Inheritance of Man site, which collects a wealth of information about individual 
genetic defects that cause human diseases. The short summary articles provide a good way to 
quickly find out about the current knowledge of a specific genetic defect. 

20. One way to find out what the different pharmaceutical companies are up to is by checking out 
their Web sites. Links to many of them are available at Phann Web . 
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Rapid access to genetic information is central to the revolution taking place in molecular genetics. The 
simultaneous analysis of the entire human mitochondrial genome is described here. DNA arrays 
containing up to 135,000 probes complementary to the 16.6-kilobase human mitochondrial genome were 
generated by light-directed chemical synthesis. A two-color labeling scheme was developed that allows 
simultaneous comparison of a polymorphic target to a reference DNA or RNA. Complete hybridization 
patterns were revealed in a matter of minutes. Sequence polymorphisms were detected with single-base 
resolution and unprecedented efficiency. The methods described are generic and can be used to address a 
variety of questions in molecular genetics including gene expression, genetic linkage, and genetic 
variability. 

Affymetrix, 3380 Central Expressway, Santa Clara, CA 95051, USA. 



A central theme in modem genetics is the relation between genetic variability and phenotype. To 
understand genetic variation and its consequences on biological function, an enormous effort in 
comparative sequence analysis will need to be carried out Conventional nucleic acid sequencing 
technologies make use of analytical separation techniques to resolve sequence at the single nucleotide 
level (I, 2). However, the effort required increases linearly with the amount of sequence. In contrast, 
biological systems read, store, and modify genetic information by molecular recognition (3). Because 
each DNA strand carries with it the capacity to recognize a uniquely complementary sequence through 
base pairing, the process of recognition, or hybridization, is highly parallel, as every nucleotide in a large 
sequence can in principle be queried at the same time. Thus, hybridization can be used to efficiently 
analyze large amounts of nucleotide sequence. In one proposal, sequences are analyzed by hybridization 
to a set of oligonucleotides representing all possible subsequences (4). A second approach, used here, is 
hybridization to air array of oligonucleotide probes designed to match specific sequences. In this way the 
most informative subset of probes is used. Implementation of these concepts relies on recently 
developed combinatorial technologies to generate any ordered array of a large number of oligonucleotide 
probes (5). 

The fundamentals of light-directed oligonucleotide array synthesis have been described (5, 6). Any 
probe can be synthesized at any discrete, specified location in the array, and any set of probes composed 
of the four nucleotides can be synthesized in a maximum of AN cycles, where N is the length of the 
longest probe in the array. For example, the entire set of ~10 12 20-nucleotide oligomer probes, or any 
desired subset, can be synthesized in only 80 coupling cycles. The number of different probes that can be 
synthesized is limited only by the physical size of the array and the achievable lithographic resolution 
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An array consisting of oligonucleotides complementary to subsequences of a target sequence can be 
used to determine the identity of a target sequence, measure its amount, and detect differences between 
the target and a reference sequence. Many different arrays can be designed for these purposes. One such 
design, termed a 4L tiled array, is depicted in Fig. iA. In each set of four probes, the perfect complement 
will hybridize more strongly than mismatched probes. By this approach, a nucleic acid target of length L 
can be scanned for mutations with a tiled array containing AL probes. For example, to query the 
16,569 base pairs (bp) of human mitochondrial DNA (mtDNA), only 66,276 probes of the possible ~10 9 
15-nucleotide oligomers need to be used. 



Fig. 1. (A) Design of a 4L tiled array. Each position in the target sequence (uppercase 
letters) is queried by a set of four probes on the chip (lowercase letters), identical except 
at a single position, termed the substitution position, which is either A, C, G, or T (blue 
indicates complementarity, red a mismatch). Two sets of probes are shown, querying 
adjacent positions in the target. (B) Effect of a change in the target sequence. The probes 
are the same as in (A), but the target now contains a single-base substitution (base C, 
shown in green). The probe set querying the changed base still has a perfect match (the G 
probe). However, probes in adjacent sets that overlap the altered target position now 
have either one or two mismatches (red) instead of zero or one, because they were 
designed to match the target shown in (A). (C) Hybridization to a 4L tiled array and 
detection of a base change in the target. The array shown was designed to the mtl 
sequence. (Top) hybridization to mtl. The substitution used in each row of probes is 
indicated to the left of the image. The target sequence can be read 5' to 3' from left to 
right as the complement of the substitution base with the brightest signal. With hybridization to mt2 
(bottom), which differs from mtl in this region by a T-*C transition, the G probe at position 16,493 is 
now a perfect match, with the other three probes having single-base mismatches (A. 5, C 3, G 37, T 
4 counts). However, at flanking positions, the probes have either single- or double-base mismatches, 
because the mt2 transition now occurs away from the query position. 
[View Larger Version of this Image (3 OK GIF file)] 



The use of a tiled array of probes to read a target sequence is illustrated in Fig. 1C. A tiled array of 
15-nucleotide oligomers varied at position 7 from the 3' end (P 15,7 ) was designed and synthesized for 
mtl, a cloned sequence containing 131 1 bp spanning the control region of mtDNA (8, 9, K), U). The 
upper panel of Fig. 1C shows a portion of the fluorescence image of an array hybridized with 
fluorescein-labeled mtl RNA (12). The base sequence can be read by comparing the intensities of the 
four probes within each column. For example, the column for position 16,493 consists of the four 

probes, 3'-TGACATA<3GCTGTAG, 3'-TGACATCGGCTGTAG, 3 ; -TGACATGgGCTGTAG, and 

3'-TGA- CATTgGCTGTAG. The probe with the strongest signal is the probe with the A substitution ( 

A, 49 counts; C, 8 counts, fi, 15 counts, and g counts, where the background is 2 counts), 
identifying the base at position 16,493 as U in the RNA transcript. Continuing the process, the sequence 
at each position can be read directly from the hybridization intensities. 

The effect on the array hybridization pattern caused by a single base change in the target is illustrated in 
Fig. IB, and the detection of a single-base polymorphism is shown in the lower panel of Fig. IC. The 
target was mt2, which differs from mtl in this region by a T-to-C transition at position 

16,493. Accordingly, the probe with the G substitution (third row) displays the strongest signal. 
Because the tiled array was designed to complement mtl, the hybridization intensities of neighboring 
probes that overlap position 16,493 are also affected by the change in target sequence. The hybridization 
signals of 15 probe sets of the 15-nucleotide oligomer tiled array are perturbed by a single base change 
in the target sequence. In the P 15 » 7 array, each probe querying the eight positions to the left and six 
positions to the right of the polymorphism contain at least one mismatch to the target. The result is a 
characteristic loss of signal or a "footprint" for the probes flanking a mutation position. Of the four 
probes querying each position, the loss of signal is greatest for the one designed to match mtl. We 
denote the subset of probes with zero mismatches to the reference sequence as P°. 
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A comparison of P° hybridization signals from a target to those from a reference is ideally obtained by 
hybridizing both samples to the same array. We therefore developed a two-color labeling and detection 
scheme in which the reference is labeled with phycoerythrin (red), and the target with fluorescein (green) 
(13). By processing the reference and target together, experimental variability during the fragmentation, 
hybridization, washing, and detection steps is minimized or eliminated. In addition, during 
cohybridization of the reference and target, competition for binding sites results in a slight improvement 
in mismatch discrimination. Array hybridization is highly reproducible, and comparative analysis of data 
obtained from separate but identically synthesized arrays is also effective. 

The two-color approach was tested by analyzing a 2.5-kb region of mtDNA that spans the tRNA Glu , 
cytochrome b, tRNA 71 * tRNA Pro , control region, and tRNA Phe DNA sequences (14). A P 20 > 9 array 
(20-nucleotide oligomer probes varied at position 9 from the 3' end) was designed to match the mtl 
target (that is, P° sequence = mtl). The mtl reference (red) and a polymorphic target sample (green) 
were pooled and hybridized simultaneously to the array. Differences between the target and reference 
sequences were identified by comparing the scaled red and green P° hybridization intensities (15). The 
marked decrease in target hybridization intensity, over a span of ~20 nucleotides, is shown for a 
single-base polymorphism at position 16,223 (Fig. 2 A). The footprint is enlarged when two 
polymorphisms occur in close proximity (within ~20 nucleotides) (Fig. 2B). When polymorphisms are 
clustered, the size of the footprint depends on the number of polymorphisms and their separation (Fig. 
2C). 



Fig. 2. Detection of base differences in a 2.5-kb region by 
comparison of scaled P° hybridization intensity patterns between a 
sample (green) and a reference (red) sequence. (A) Comparison of 
sequence ief007 to mtl. In the region shown, there is a single-base 
difference between the two sequences, located at position 16,223 (C 
in mtl, T in ief007). This results in a "footprint" spanning ~20 
positions, 1 1 to the left and 8 to the right of position 16,223, in which 
the ief007 P° intensities are decreased by a factor of more than 10 on 
average relative to the mtl intensities. The predicted footprint 
location is indicated by the gray bar, and the location of the 
polymorphism is shown by a vertical black line within the bar. The size of a footprint changes with 
probe length, and its relative position with substitution position (not shown). (B) Comparison of 
sequence haOOl to mtl. The haOOl target has four polymorphisms relative to mtl. The P° intensity 
pattern clearly shows two regions of difference between the targets. Each region contains two or more 
differences, because in both cases the footprints are longer than 20 positions and therefore are too 
extensive to be explained by a single-base difference. The effect of competition can be seen by 
comparing the mtl intensities in the ief007 and haOOl experiments: The relative intensities of mtl are 
greater in (B) where haOOl contains P° mismatches but ief007 does not. (C) The ha004 sequence has 
multiple differences to mtl, resulting in a complex pattern extending over most of the region shown. 
Thus, differences are clearly detected. Because hybridization intensities are extremely 
sequence-dependent, each of the mitochondrial sequences can also be identified simply by its 
hybridization pattern. [View Larger Version of this Image (29K GIF file)] 



To read polymorphisms accurately, we developed an algorithm that addresses the issue of multiple 
mismatches. The algorithm performs base identification but also flags regions of ambiguity caused by 
multiple mismatches. These regions are easily identified by the presence of a large footprint (Fig. 2, B 
and C) or by two or more bases identified as differing from P° within the span of a single probe. 
Discrepancies between base identifications and footprint patterns are also flagged for further analysis 
(for example, a P° footprint in which no polymorphism is identified; such a pattern is typical of a 
deletion). Thus, base identifications are valid only for unflagged regions. In flagged regions, the 
presence of sequence differences is detected, but no attempt is made to identify the sequence without 
further analysis. 
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Sequence analysis was carried out on the 2.5-kb target from 12 samples. A total of 30,582 bp containing 
180 substitutions relative to mtl were analyzed. Ninety-eight percent of the sequence was 
unambiguously assigned by a Bayesian base identification algorithm (16). Of this 98%, which contained 
both wild-type sequence and a high proportion of single-base footprints such as the example shown in 
Fig. 2 A, 29,878 out of 29,879 bp were identified correctly (17). The remaining 2% of the sequence, 
which contained the multiple substitution footprints (such as those shown in Fig. 2, B and C), was 
flagged for further analysis. Of the 649 bp composing this 2%, 643 bp were located in or immediately 
adjacent to footprints (18). In all, 179 out of the 180 polymorphisms were unambiguously detected, 
126 out of 127 were identified correctly in the unflagged regions, and 53 polymorphisms occuring in the 
flagged regions were detected as footprints. There were no unflagged false-positive base identifications, 
and only one false-positive footprint. These figures can be considered to be "worst case" estimates for 
the type of array and target used. The P° sequence represents a Caucasian haplotype, and our sample set 
included eight African samples having a large number of clustered differences to P°. Furthermore, the 
variation in the hypervariable part of the control region is much higher than for the rest of the 
mitochondrial genome and for nuclear genes in general (Fig. 2 shows comparisons to African samples in 
this region). 

The determination of a complete human mitochondrial DNA sequence more than 15 years ago has had a 
tremendous influence on studies of human origins and evolution and the role of mutations in 
degenerative diseases (8, 10, 19). Because of the cost and difficulty of conventional sequence analysis, 
most subsequent sequencing studies have focused only on two small hypervariable regions totaling 
~600 bp (9). However, access to the entire genome is required for a full understanding of the governing 
genetics. We therefore designed a P 25 » 13 tiling array for the mitochondrial genome. The array contains a 
total of 136,528 synthesis cells, each ~35 |im by 35 nm in size (Fig. 3). In addition to a 4L tiling across 
the genome, the array contains a set of probes representing a single-base deletion at every position across 
the genome and sets of probes designed to match a range of specific mtDNA haplotypes. Using 
long-range polymerase chain reaction, we amplified the 16.6-kb mtDNA directly from genomic DNA 
samples (20). Labeled RNA targets were prepared by in vitro transcription and hybridized to the array. 
Genomic hybridization patterns were imaged in less than 10 min by a high-resolution confocal scanner 
(21). 

c *»*t-5| - ■ — — • 

Fig. 3. Human mitochondrial genome on a chip. (A) An image of the 
array hybridized to 16.6 kb of mitochondrial target RNA (L strand). The 
16,569-bp map of the genome is shown, and the H strand origin of 
replication (Ojj), located in the control region, is indicated. (B) A portion 

of the hybridization pattern magnified. In each column there are five 
probes: A, C, G, T, and A, from top to bottom. The A probe has a 
single-base deletion instead of a substitution and hence is 24 instead of 
25 bases in length. The scale is indicated by the bar beneath the image. 
Although there is considerable sequence-dependent intensity variation, 
most of the array can be read directly. The image was collected at a 
resolution of ~100 pixels per probe cell. (C) The ability of the array to 
detect and read single-base differences in a 16.6-kb sample is illustrated. 
Two different target sequences were hybridized in parallel to different chips. The hybridization patterns 
are compared for four different positions in the sequence. Only the P 25 > 13 probes are shown. The top 
panel of each pair shows the hybridization of the mt3 target, which matches the chip P° sequence at 
these positions. The lower panel shows the pattern generated by a sample from a patient with Leber's 
hereditary optic neuropathy (LHON). Three known pathogenic mutations, LHON3460, LHON4216, and 
LHON 13708, are clearly detected. For comparison, the fourth panel in the set shows a region around 
position 1 1,778 that is identical in both samples. [View Larger Version of this Image (1 17K GIF file)! 




The hybridization pattern of a 16.6-kb target to the mitochondrial genome chip is shown in Fig. 2. 
Although there are some regions of low intensity, most of the 25-nucleotide oligomer array hybridized 
efficiently: Simply by identifying the highest intensity in each column of four substitution probes, 99.0% 
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of the mt3 sequence could be read correctly (P° sequence = mt3). The array was used to successfully 
detect three disease-causing mutations in a mtDNA sample from a patient with Leber's hereditary optic 
neuropathy (22, 23) (Fig. 3C). In addition, we detected a total of seven errors and new polymorphisms 
from previously unsequenced regions. 

We then hybridized 10 genomes from African individuals to the array and unambiguously identified 
505 polymorphisms. These were polymorphisms that could be clearly read and for which a confirmatory 
footprint was detected automatically. For the 10 samples, the 2.5-kb cytochrome b and control region 
sequences were known Q7). No false positives were detected in the ~25 kb of sequence checked in this 
way. Additional clustered polymorphisms were detected by the presence of footprints but not read 
directly. A detailed analysis of the polymorphisms in these genomes, and others, will be presented 
elsewhere. 

The throughput of a conventional gel-based sequencer, with an average read length of 400 nucleotides 
and 48 lanes that is run twice a day, might be two mitochondrial genomes a day at best. In contrast, the 
throughput of the nonoptimized system we describe is five chips per hour. Thus, 50 genomes can be read 
by hybridization in the time it takes to read two genomes conventionally. Furthermore, there are 
significant reductions in sample preparation requirements because the entire genome is labeled in a 
single reaction, so the cost is similar to that for a single sequencing reaction. Also, sequence reading at 
the level of data analysis is automated: The sequences can be read in a matter of minutes. No analytical 
separations or gel preparation is needed, which contributes to the speed of the experiment. Although the 
inability to read all possible sequences is a weakness of the 4L tiled array, it is not a major limitation, 
because in practice the small number of ambiguities can be checked by targeted conventional 
sequencing. In particular, highly repetitive sequences, such as long runs of a single base, are presently 
best analyzed with conventional technology. Finally, a clear advantage to the approach we describe is 
that it is highly scalable. The cost, effort, and time required to analyze the entire 16.6-kb mtDNA in a 
single experiment is virtually identical to that required to read 2.5 kb. This provides a clear path to 
further orders-of-magnitude improvements in efficiency. 

High-density oligonucleotide arrays provide the foundation for a powerful genetic analysis technology. 
The method can be used to characterize the spectrum of sequence variation in a population and can be 
applied to the analysis of many genes in parallel. In the case of human mtDNA, we simultaneously 
analyzed the control region, 13 protein coding genes, 22 tRNA genes, and 2 ribosomal RNA genes. The 
methods described here can be applied to other research areas in molecular genetics; for example, the 
ability to identify and sequence polymorphisms provides a basis for genetic mapping. The specificity of 
oligonucleotide hybridization and the scalability of the method suggests the possibility of a dedicated 
array that could be used to generate a high-resolution genetic map of an entire genome in a single 
experiment. Likewise, the concepts and techniques described here have been used to develop approaches 
for mRNA identification and the large-scale, parallel measurement of expression levels (24). Thus, the 
sequence of a gene, its spectrum of change in the population, its chromosomal location, and its dynamics 
of expression (all essential to a full understanding of function) can be determined with high-density 
probe arrays. The challenge now is to synthesize and read probe arrays at even higher density. For 
example, a 2 cm by 2 cm array, synthesized with probes occupying 1-^m synthesis sites in a 4L tiling, 
could query the entire coding content of the human genome, estimated at 100,000 genes. 
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L14675-T3 (5'-aattaaccctcactaaagggATTCTCGCACGGACTACAAC) and H667-T7 (U). 
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Figure 1. (A) Design of a 4L tiled array. Each position in the target sequence (uppercase letters) is 
queried by a set of four probes on the chip (lowercase letters), identical except at a single position, 
termed the substitution position, which is either A, C, G, or T (blue indicates complementarity, red a 
mismatch). Two sets of probes are shown, querying adjacent positions in the target. (B) Effect of a 
change in the target sequence. The probes are the same as in (A), but the target now contains a 
single-base substitution (base C, shown in green). The probe set querying the changed base still has a 
perfect match (the G probe). However, probes in adjacent sets that overlap the altered target position 
now have either one or two mismatches (red) instead of zero or one, because they were designed to 
match the target shown in (A). (C) Hybridization to a 4L tiled array and detection of a base change in the 
target. The array shown was designed to the mtl sequence. (Top) hybridization to mtl. The substitution 
used in each row of probes is indicated to the left of the image. The target sequence can be read 5' to 3' 
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from left to right as the complement of the substitution base with the brightest signal. With hybridization 
to mt2 (bottom), which differs from mtl in this region by a T-K: transition, the G probe at position 
16,493 is now a perfect match, with the other three probes having single-base mismatches (A 5, C 3 y G 
37, X 4 counts). However, at flanking positions, the probes have either single- or double-base 
mismatches, because the mt2 transition now occurs away from the query position. 
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Figure 2. Detection of base differences in a 2.5-kb region by comparison of scaled P° hybridization 
intensity patterns between a sample (green) and a reference (red) sequence. (A) Comparison of sequence 
ief007 to mtl. In the region shown, there is a single-base difference between the two sequences, located 
at position 16,223 (C in mtl, T in ief007). This results in a "footprint" spanning ~20 positions, 1 1 to the 
left and 8 to the right of position 16,223, in which the ief007 P° intensities are decreased by a factor of 
more than 10 on average relative to the mtl intensities. The predicted footprint location is indicated by 
the gray bar, and the location of the polymorphism is shown by a vertical black line within the bar. The 
size of a footprint changes with probe length, and its relative position with substitution position (not 
shown). (B) Comparison of sequence haOOl to mtl. The haOOl target has four polymorphisms relative to 
mtl. The P° intensity pattern clearly shows two regions of difference between the targets. Each region 
contains two or more differences, because in both cases the footprints are longer than 20 positions and 
therefore are too extensive to be explained by a single-base difference. The effect of competition can be 
seen by comparing the mtl intensities in the ief007 and haOOl experiments: The relative intensities of 
mtl are greater in (B) where haOOl contains P° mismatches but ief007 does not. (C) The ha004 
sequence has multiple differences to mtl, resulting in a complex pattern extending over most of the 
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region shown. Thus, differences are clearly detected. Because hybridization intensities are extremely 
sequence-dependent, each of the mitochondrial sequences can also be identified simply by its 
hybridization pattern. 
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Figure 3. Human mitochondrial genome on a chip. (A) An image of the array hybridized to 16.6 kb of 
mitochondrial target RNA (L strand). The 16,569-bp map of the genome is shown, and the H strand 
origin of replication (Ojj), located in the control region, is indicated. (B) A portion of the hybridization 

pattern magnified. In each column there are five probes: A, C, G, T, and A, from top to bottom. The 
A probe has a single-base deletion instead of a substitution and hence is 24 instead of 25 bases in length. 
The scale is indicated by the bar beneath the image. Although there is considerable sequence-dependent 
intensity variation, most of the array can be read directly. The image was collected at a resolution of 
~100 pixels per probe cell. (C) The ability of the array to detect and read single-base differences in a 
16.6-kb sample is illustrated. Two different target sequences were hybridized in parallel to different 
chips. The hybridization patterns are compared for four different positions in the sequence. Only the 
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P 25 > 13 probes are shown. The top panel of each pair shows the hybridization of the mt3 target, which 
matches the chip P° sequence at these positions. The lower panel shows the pattern generated by a 
sample from a patient with Leber's hereditary optic neuropathy (LHON). Three known pathogenic 
mutations, LHON3460, LHON4216, and LHON13708, are clearly detected. For comparison, the fourth 
panel in the set shows a region around position 1 1,778 that is identical in both samples. 
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